The research on the human microbiome has gained significant popularity over the past couple of years across both scientific and public communities. This area has revealed considerable knowledge, mainly thanks to next-generation sequencing. Even though technology possibilities evolve faster than ever, and there is an enormous amount of publicly available data, there are not yet precise procedures, methods, and bioinformatics tools to work accurately enough with microbiome data, and there are no standardized recommendations for bioinformatics pipelines to assess the most accurate results.
Here, we demonstrate the potential inaccuracies that can arise from various bioinformatics processing of amplicon 16S RNA sequencing data by performing a comparative analysis of diverse pipeline configurations using mock community samples and biological material originating from biopsy samples obtained during regular colonoscopy from patients with primary sclerosing cholangitis.
This study is based on data from the research on the pathology of primary sclerosing cholangitis (PSC) - see Part 2. The biological material was obtained during routine protocolary colonoscopies from patients who had undergone liver transplantation for PSC and regularly attended follow-up visits at IKEM. The study was performed according to the Declaration of Helsinki, including the changes accepted during the 59th WMA General Assembly, and approved by the Joint Ethics Committee of IKEM and Thomayer Hospital. All patients provided dedicated informed consent.
In addition, mock community samples (ZymoBIOMICS Microbial Community DNA Standard) were used as internal standards. A dataset of 62 samples was created, comprising 22 internal standards, 20 cecum biopsy samples from patients who underwent liver transplantation for PSC and did not develop recurrence and 20 control samples of cecum biopsy of patients who underwent liver transplantation due to alcohol use disorder (AUD).
Part 1 refers to the mock community analysis alongside with negative control samples, to check to what degree there is contamination in individual libraries.
Part 2 refers to the analysis of real biopsy samples.
All samples underwent initial quality checks using FastQC v0.11.9 in combination with MultiQC v1.12. Next, primers were trimmed using CutAdapt v3.5, with any untrimmed reads discarded. Regarding bioinformatics processing, 28 different parameter (see tables below) settings across three pipelines for reads denoising, namely DADA2, VSEARCH-UNOISE3, and Deblur, were employed to process the data - see individual sections for more details.
library(openxlsx)
dada2_settings <- read.xlsx("combinations_settings.xlsx",sheet = "dada_combinations",rowNames = TRUE)
dada2_settings
vsearch_settings <- read.xlsx("combinations_settings.xlsx",sheet = "vsearch combinations",rowNames = TRUE)
vsearch_settings
deblur_settings <- read.xlsx("combinations_settings.xlsx",sheet = "deblur", rowNames = FALSE)
deblur_settings
The whole analysis was performed on the genus taxonomic level.
Part 1
All mock community samples processed with different pipelines were compared to the reference relative abundance of ZymoBIOMICS Microbial Community DNA Standard.
The average precision and recall were calculated for each pipeline setting to quantify the accurately detected taxa. The Bray-Curtis distance was computed using vegan v2.6.4 and Phyloseq v1.46.0 to compare the relative abundance numbers. Results were plotted using ggplot2 v3.4.4.
The differences between Bray-Curtis distances were tested using Kruskal-Wallis and post-hoc Dunn’s test.
Part 2
To focus only on the most relevant pipeline settings, the two best-performing settings for each pipeline were picked to process the data. These had to meet the criteria of precision higher than 0.98, recall higher than 0.75, and median Bray-Curtis distance lower than 0.4. Furthermore, the default settings of each pipeline were added to this comparison.
Statistical analysis began with filtration, removing the very low abundant (abundance < 0.01% within a sample) and very low prevalent (prevalence < 10% of all samples) genera. Next, to compare these two groups, the Shannon index was calculated and tested using the Mann-Whitney U test. Subsequently,nthe multivariate analysis was performed, testing differences in beta diversity (Aitchison distance) via PERMANOVA.
Finally,univariate analysis was done utilizing Aldex2 v1.34 using 128
Monte-Carlo samples.
To answer the question to what degree pipelines differ from each other,
a simple comparison was made (i) by plotting PCA on clr-transformed data
and (ii) by comparing outputs of taxonomic assignments based on
different pipelines. To highlight the different performances of the
taxonomic classifiers, the VSEARCH results were re-classified in the
same way as the other two pipelines using IDTAXA.
# PACKAGES
suppressWarnings(suppressMessages({
library(reshape2)
library("phyloseq")
library("ggplot2")
library("dplyr")
library(tibble)
library(vegan)
library(compositions)
library(openxlsx)
library(DECIPHER)
library(pheatmap)
library("biomformat")
library(Biostrings)
library(decontam)
library(microDecon)
library(FSA)
library(ggpubr)
library(ggfortify)
library(eulerr)
library(microbiome)
}))
normalization <- function(phyloseq_object){
# normalization of phyloseq object
phyloseq_object_n <- phyloseq_object
asv_norm <- phyloseq_object_n@otu_table
taxa_norm <- phyloseq_object_n@tax_table
sums <- colSums(asv_norm[,])
for (i in 1:length(sums)) {
asv_norm[,i] <- asv_norm[,i]/sums[i]
}
where <- rowSums(asv_norm) !=0
asv_norm <- asv_norm[where,]
taxa_norm <- taxa_norm[where,]
phyloseq_object_n@otu_table <- asv_norm
phyloseq_object_n@tax_table <- taxa_norm
return(phyloseq_object_n)
}
normalization2 <- function(taxa_reads_table){
# normalization of dataframe
taxa_reads_table_n <- taxa_reads_table
asv_norm <- taxa_reads_table_n[,7:ncol(taxa_reads_table_n)]
sums <- colSums(asv_norm[,])
for (i in 1:length(sums)) {
asv_norm[,i] <- asv_norm[,i]/sums[i]
}
where <- rowSums(asv_norm) !=0
asv_norm <- asv_norm[where,]
taxa_reads_table_n <- taxa_reads_table_n[where,]
taxa_reads_table_n[,7:ncol(taxa_reads_table_n)] <- asv_norm
return(taxa_reads_table_n)
}
abundance_filtering <- function(phyloseq_object, to_filter= 0.01){
# abundance filtering of phyloseq object
phyloseq_object_n <- normalization(phyloseq_object)
filtering <- filter_taxa(phyloseq_object_n, function(x) sum(x > to_filter)>0 , FALSE)
phyloseq_object@otu_table <- phyloseq_object@otu_table[filtering,]
phyloseq_object@tax_table <- phyloseq_object@tax_table[filtering,]
return(phyloseq_object)
}
mock_zymo_genus_merging <- function(taxa_table,asv_table,zymo_genus){
# merging mock community samples with reference abundance
taxa_table[is.na(taxa_table)] <- "unassigned"
taxa_reads_table <- merge(taxa_table,asv_table, by="ASV", all=TRUE)
groups <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
columns <- names(taxa_reads_table)[9:ncol(taxa_reads_table)] # names of columns with samples
grouped_genus <- taxa_reads_table %>%
group_by_at(groups[1:6]) %>%
summarise(across(columns, sum))
mock_genus_n <- normalization2(grouped_genus)
to_filter <- (rowSums((mock_genus_n[,7:ncol(mock_genus_n)])>0.01))>0
mock_genus_n <- mock_genus_n[to_filter,]
mock_zymo_genus <- merge(mock_genus_n,zymo_genus, all=TRUE)
mock_zymo_genus[is.na(mock_zymo_genus)] <- 0
return(mock_zymo_genus)
}
genus_merging <- function(taxa_table,asv_table){
# merging taxonomy and filtering
taxa_table[is.na(taxa_table)] <- "unassigned"
taxa_reads_table <- merge(taxa_table,asv_table, by="ASV", all=TRUE)
groups <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
columns <- names(taxa_reads_table)[9:ncol(taxa_reads_table)] # names of columns with samples
grouped_genus <- taxa_reads_table %>%
group_by_at(groups[1:6]) %>%
summarise(across(columns, sum))
asv_table <- grouped_genus[,7:ncol(grouped_genus)]
asv_table["ASV"] <- paste0("GENUS", 1:nrow(asv_table))
taxa_table <- grouped_genus[,1:6]
taxa_table["ASV"] <- paste0("GENUS", 1:nrow(taxa_table))
# Step 1: Calculate abundance of each ASV within each sample
abundance <- asv_table[, -which(colnames(asv_table)=="ASV")] / colSums(asv_table[,-which(colnames(asv_table)=="ASV")]) * 100
# Step 2: Calculate prevalence of each ASV across all samples
prevalence <- rowSums(abundance > 0) / ncol(abundance) * 100
# Step 3: Filter out ASVs that are very low abundant and very low prevalent
threshold_abundance <- 0.01 # Abundance threshold (0.01%)
threshold_prevalence <- 10 # Prevalence threshold (10%)
# Filter ASVs based on abundance and prevalence
to_filter <- rowSums(abundance >= threshold_abundance) > 0 & prevalence >= threshold_prevalence
asv_table <- asv_table[to_filter, ]
taxa_table <- taxa_table[to_filter,]
# Output filtered ASV table
return(list(asv_table,taxa_table))
}
construct_phyloseq <- function(asv_table, taxa_table){
# construct phyloseq object - for mock communities (without metadata)
otu_mat <- asv_table
tax_mat <- taxa_table
rownames(otu_mat) <- 1:nrow(otu_mat)
rownames(tax_mat) <- 1:nrow(tax_mat)
otu_mat <- otu_mat %>%
tibble::column_to_rownames("ASV")
tax_mat <- tax_mat %>%
tibble::column_to_rownames("ASV")
otu_mat <- as.matrix(otu_mat)
tax_mat <- as.matrix(tax_mat)
OTU = otu_table(otu_mat, taxa_are_rows = TRUE)
TAX = tax_table(tax_mat)
object <- phyloseq(OTU, TAX)
return(object)
}
construct_phyloseq_real <- function(asv_table, taxa_table, metadata){
# construct phyloseq obect - for biopsy samples (with metadata)
otu_mat <- asv_table
tax_mat <- taxa_table
samples_df <- metadata
rownames(otu_mat) <- 1:nrow(otu_mat)
rownames(tax_mat) <- 1:nrow(tax_mat)
otu_mat <- otu_mat %>%
tibble::column_to_rownames("ASV")
tax_mat <- tax_mat %>%
tibble::column_to_rownames("ASV")
samples_df <- samples_df %>%
tibble::column_to_rownames("SampleID")
otu_mat <- as.matrix(otu_mat)
tax_mat <- as.matrix(tax_mat)
OTU = otu_table(otu_mat, taxa_are_rows = TRUE)
TAX = tax_table(tax_mat)
samples = sample_data(samples_df)
object <- phyloseq(OTU, TAX, samples)
return(object)
}
read_counts <- function(asv_table,input_numbers){
# function for visualizing the read counts (input vs retained)
counts <- as.data.frame(colSums(asv_table[,- grep("ASV",colnames(asv_table))]))
colnames(counts) <- "Count"
counts["Sample"] <- gsub("-","_",rownames(counts))
rownames(counts) <- counts$Sample
counts <- merge(counts,input_numbers,by='row.names',all=TRUE)
counts %>%
ggplot() +
geom_col( aes(x=reorder(Sample,input), y=input), alpha=.8, width=1, show.legend = TRUE,fill="#a5cee3") +
geom_text(aes(x=reorder(Sample,input), y=input,label = input), size = 3, hjust = -0.2) +
geom_col( aes(x=reorder(Sample,Count), y=Count), alpha=.8, width=0.5, show.legend = TRUE,fill="#e3211c") +
geom_text(aes(x=reorder(Sample,Count), y=Count,label = Count), size = 3, hjust = -0.1,color="#e32000") +
scale_fill_brewer(palette = "Set2") +
coord_flip() +
xlab("") +
labs(y = "Number of reads") +
theme(legend.position='none') +
theme_bw()
}
precision <- function(mock_object_filt, zymo_taxa){
# calculating precision
asv_filt <- as.data.frame(mock_object_filt@otu_table)
asv_filt <- asv_filt[,grep("Mock",colnames(asv_filt))]
asv_filt <- asv_filt[rowSums(asv_filt)>0,]
taxa_filt <- as.data.frame(mock_object_filt@tax_table)
taxa_filt <- taxa_filt[rownames(asv_filt),]
zymo_taxa <- zymo_taxa[nchar(zymo_taxa$Genus)>0,"Genus"]
precs <- c()
for (c in 1:ncol(asv_filt)){
taxa_sample <- taxa_filt[rownames(asv_filt)[(asv_filt[,colnames(asv_filt)[c]]>0)],"Genus"]
taxa_sample <- taxa_sample[!is.na(taxa_sample)]
tp <- sum(taxa_sample %in% zymo_taxa)
fp <- sum(!(taxa_sample %in% zymo_taxa))
prec = tp/(tp + fp)
precs <- c(precs,prec)
}
return(mean(precs))
}
recall <- function(mock_object_filt, zymo_taxa){
# calculating recall
asv_filt <- as.data.frame(mock_object_filt@otu_table)
asv_filt <- asv_filt[,grep("Mock",colnames(asv_filt))]
asv_filt <- asv_filt[rowSums(asv_filt)>0,]
taxa_filt <- as.data.frame(mock_object_filt@tax_table)
taxa_filt <- taxa_filt[rownames(asv_filt),]
zymo_taxa <- zymo_taxa[nchar(zymo_taxa$Genus)>0,"Genus"]
recs <- c()
for (c in 1:ncol(asv_filt)){
taxa_sample <- taxa_filt[rownames(asv_filt)[(asv_filt[,colnames(asv_filt)[c]]>0)],"Genus"]
taxa_sample <- taxa_sample[!is.na(taxa_sample)]
tp <- sum(taxa_sample %in% zymo_taxa)
fn <- sum(!(zymo_taxa %in% taxa_sample))
rec = tp/(tp+fn)
recs <- c(recs,rec)
}
return(mean(recs))
}
READ COUNTS
input_numbers <- read.csv("results/input_reads.csv", sep="\t", row.names = 1)
rownames(input_numbers) <- regmatches(rownames(input_numbers), regexpr("(Mock\\d+_\\d+)|(NC\\d+_\\d+)", rownames(input_numbers)))
ZYMO RESEARCH REFERENCE
zymo_asv_table <- read.csv("zymo/asv_table.csv", sep=",")
zymo_asv_table
zymo_taxa <- read.csv("zymo/taxa.csv", sep=",")
zymo_taxa
zymo_object <- construct_phyloseq(zymo_asv_table,zymo_taxa)
taxa_names(zymo_object) <- paste0("ZYMO",1:8)
plot_bar(zymo_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_minimal() +
xlab("") +
ylab("Relative Abundance") +
scale_x_discrete(labels=c('Zymo Research Reference'),guide = guide_axis(angle = 0)) +
geom_text(aes(label = Abundance), position = position_stack(vjust = .5))
GENUS MERGING
zymo_taxa_reads_table <- merge(zymo_taxa,zymo_asv_table, by="ASV", all=TRUE)
groups <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
columns <- names(zymo_taxa_reads_table)[9:ncol(zymo_taxa_reads_table)] # names of columns with samples
zymo_genus <- zymo_taxa_reads_table %>%
group_by_at(groups[1:6]) %>%
summarise(across(columns, sum))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(columns, sum)`.
## Caused by warning:
## ! Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(columns)
##
## # Now:
## data %>% select(all_of(columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
Utilizing DADA2, the paired reads were processed separately. After filtering and trimming with default parameters except for expected errors, the error model was trained for error rate estimation, and ASVs were inferred. The final step involved merging the paired reads, with default settings except for the maximum allowed mismatches in the overlap region. Chimeras were removed using the ‘removeBimeraDenovo’ function and taxonomic composition was obtained using IDTAXA from the DECIPHER R package v2.30.0. Taxonomic classifier was used in combination with SILVA v138.1 database.
dada2_settings
asv_table <- read.csv("results_dada2/1/final_seqtab_1.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/1/taxa_1.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
Phyloseq object, filtering
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting1_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 1"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- data.frame(samples=NA)
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- data.frame()
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/2/final_seqtab_2.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/2/taxa_2.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting2_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 2"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/3/final_seqtab_3.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/3/taxa_3.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting3_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 3"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/4/final_seqtab_4.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/4/taxa_4.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting4_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 4"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/5/final_seqtab_5.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/5/taxa_5.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting5_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 5"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/6/final_seqtab_6.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/6/taxa_6.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting6_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 6"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/7/final_seqtab_7.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/7/taxa_7.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting7_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 7"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/8/final_seqtab_8.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/8/taxa_8.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting8_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 8"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/9/final_seqtab_9.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/9/taxa_9.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting9_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 9"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/10/final_seqtab_10.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/10/taxa_10.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting10_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 10"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/11/final_seqtab_11.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/11/taxa_11.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting11_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 11"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/12/final_seqtab_12.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/12/taxa_12.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting12_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 12"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/13/final_seqtab_13.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/13/taxa_13.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting13_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 13"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_dada2/14/final_seqtab_14.csv")
colnames(asv_table) <- c("ASV",regmatches(colnames(asv_table), regexpr("[[:alpha:]]+\\d+_\\d+", colnames(asv_table))))
asv_table
taxa_table <- read.csv("results_dada2/14/taxa_14.csv", sep=",", row.names = 1)
taxa_table
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting14_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 14"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
On the heatmap, there is the Bray Curtis distance for each sample and the DADA2 settings with which they were processed. The settings are color coded. The smaller the Bray-Curtis distance, the more similar the sample is to the reference. Negative controls are less similar, therefore their distance is larger (parts with red colouring).
annotation_df <- data.frame(maxEE = c("c(2,2)","c(2,2)","c(2,3)","c(2,3)","c(2,4)","c(2,4)","c(2,5)","c(2,5)", "c(1,3)", "c(1,3)", "c(1,4)","c(1,4)","c(1,5)","c(1,5)"),maxMismatch = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1),row.names = paste("SETTING",(1:14)))
bray_curtis_distances$samples = rownames(bray_curtis_distances)
pheatmap(bray_curtis_distances[-which(rownames(bray_curtis_distances)=="Mock.Reference"),-which(names(bray_curtis_distances) == "samples")],cluster_rows = FALSE, cellheight = 6, annotation_col = annotation_df)
First, the one final variable has to be created to store the results for all settings (all_settings_genus_object).
MERGING ALL SETTINGS TOGETHER
colnames(setting1_genus)[7:ncol(setting1_genus)] <- paste("S1", colnames(setting1_genus)[7:(ncol(setting1_genus))])
colnames(setting2_genus)[7:ncol(setting2_genus)] <- paste("S2", colnames(setting2_genus)[7:ncol(setting2_genus)])
colnames(setting3_genus)[7:ncol(setting3_genus)] <- paste("S3", colnames(setting3_genus)[7:ncol(setting3_genus)])
colnames(setting4_genus)[7:ncol(setting4_genus)] <- paste("S4", colnames(setting4_genus)[7:ncol(setting4_genus)])
colnames(setting5_genus)[7:ncol(setting5_genus)] <- paste("S5", colnames(setting5_genus)[7:ncol(setting5_genus)])
colnames(setting6_genus)[7:ncol(setting6_genus)] <- paste("S6", colnames(setting6_genus)[7:ncol(setting6_genus)])
colnames(setting7_genus)[7:ncol(setting7_genus)] <- paste("S7", colnames(setting7_genus)[7:ncol(setting7_genus)])
colnames(setting8_genus)[7:ncol(setting8_genus)] <- paste("S8", colnames(setting8_genus)[7:ncol(setting8_genus)])
colnames(setting9_genus)[7:ncol(setting9_genus)] <- paste("S9", colnames(setting9_genus)[7:ncol(setting9_genus)])
colnames(setting10_genus)[7:ncol(setting10_genus)] <- paste("S10", colnames(setting10_genus)[7:ncol(setting10_genus)])
colnames(setting11_genus)[7:ncol(setting11_genus)] <- paste("S11", colnames(setting11_genus)[7:ncol(setting11_genus)])
colnames(setting12_genus)[7:ncol(setting12_genus)] <- paste("S12", colnames(setting12_genus)[7:ncol(setting12_genus)])
colnames(setting13_genus)[7:ncol(setting13_genus)] <- paste("S13", colnames(setting13_genus)[7:ncol(setting13_genus)])
colnames(setting14_genus)[7:ncol(setting14_genus)] <- paste("S14", colnames(setting14_genus)[7:ncol(setting14_genus)])
all_settings_genus <- merge(setting1_genus,setting2_genus[,-which(colnames(setting2_genus)=="S2 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting3_genus[,-which(colnames(setting3_genus)=="S3 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting4_genus[,-which(colnames(setting4_genus)=="S4 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting5_genus[,-which(colnames(setting5_genus)=="S5 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting6_genus[,-which(colnames(setting6_genus)=="S6 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting7_genus[,-which(colnames(setting7_genus)=="S7 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting8_genus[,-which(colnames(setting8_genus)=="S8 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting9_genus[,-which(colnames(setting9_genus)=="S9 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting10_genus[,-which(colnames(setting10_genus)=="S10 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting11_genus[,-which(colnames(setting11_genus)=="S11 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting12_genus[,-which(colnames(setting12_genus)=="S12 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting13_genus[,-which(colnames(setting13_genus)=="S13 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting14_genus[,-which(colnames(setting14_genus)=="S14 Mock.Reference")], all=TRUE)
all_settings_genus[is.na(all_settings_genus)] <- 0
# phyloseq object
all_settings_genus_asv <- all_settings_genus[,7:ncol(all_settings_genus)]
all_settings_genus_asv["ASV"] <- paste("ASV",1:nrow(all_settings_genus_asv))
all_settings_genus_taxa <- all_settings_genus[,1:6]
all_settings_genus_taxa["ASV"] <- paste("ASV",1:nrow(all_settings_genus_taxa))
all_settings_genus_object <- construct_phyloseq(all_settings_genus_asv, all_settings_genus_taxa)
ORDINATION - PCoA
Here, the Principal Coordinate Analysis (PCoA) is used to visualize the distance matrix. Mock community samples cluster separately from negative controls.
ord_b <- ordinate(all_settings_genus_object, method = "PCoA", distance = "bray")
data_for_pca_bray <- as.data.frame(t(all_settings_genus_object@otu_table))
data_for_pca_bray["setting"] <- substring(rownames(data_for_pca_bray),1,3)
data_for_pca_bray["sample"] <- regmatches(rownames(data_for_pca_bray), regexpr("(Mock)|(NC)|(Mock.Reference)", rownames(data_for_pca_bray)))
data_for_pca_bray[data_for_pca_bray$sample=="Mock.Reference","sample"] <- "ZYMO REFERENCE"
imp_bray <- ord_b$values$Relative_eig
pca_bray <- ord_b$vectors
ggplot(data_for_pca_bray, aes(x=pca_bray[,1],y=pca_bray[,2], col= sample)) +
geom_point(show.legend =TRUE) +
xlab(paste("PCo1 ", "(",round(imp_bray[1]*100,2),"%", ")", sep=""))+
ylab(paste("PCo2 ", "(",round(imp_bray[2]*100,2),"%", ")", sep=""))+
theme_bw() +
labs(title="PCoA on Bray-Curtis")
Next, the mock community samples without the negative control presence are analyzed, to investigate the patterns in taxonomic composition revealed by individual bioinformatics pipeline.
In the PCoA plot, some samples processed with the same setup form a consistent cluster, on the other hand (e.g. S6), other setups caused the samples to be more fragmented on the plot (e.g. S13).
# WITHOUT NC
all_settings_genus_asv_mock <- all_settings_genus_asv[,grep("(Mock)|(ASV)",colnames(all_settings_genus_asv))]
to_filter <- (rowSums((all_settings_genus_asv_mock[,-which(colnames(all_settings_genus_asv_mock)=="ASV")])>0))>0
all_settings_genus_asv_mock <- all_settings_genus_asv_mock[to_filter,]
all_settings_genus_taxa_mock <- all_settings_genus_taxa[to_filter,]
# BRAY CURTIS
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(all_settings_genus_asv_mock[,-which(colnames(all_settings_genus_asv_mock)=="ASV")])), method="bray"))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"S1 Mock.Reference"])
rownames(mock_b_dist) <- samples
colnames(mock_b_dist) <- "bray_curtis"
all_settings_genus_mock_object <- construct_phyloseq(all_settings_genus_asv_mock, all_settings_genus_taxa_mock)
all_settings_genus_mock_object_dada2 <- all_settings_genus_mock_object
# ORDINATION
ord_b <- ordinate(all_settings_genus_mock_object, method = "PCoA", distance = "bray")
data_for_pca_bray <- as.data.frame(t(all_settings_genus_mock_object@otu_table))
data_for_pca_bray["Setting"] <- substring(rownames(data_for_pca_bray),1,3)
data_for_pca_bray["sample"] <- regmatches(rownames(data_for_pca_bray), regexpr("(Mock)|(Mock.Reference)", rownames(data_for_pca_bray)))
data_for_pca_bray[data_for_pca_bray$sample=="Mock.Reference","sample"] <- "ZYMO REFERENCE"
data_for_pca_bray[data_for_pca_bray$sample=="ZYMO REFERENCE","Setting"] <- "ZYMO REFERENCE"
imp_bray <- ord_b$values$Relative_eig
pca_bray <- ord_b$vectors
ggplot(data_for_pca_bray, aes(x=pca_bray[,1],y=pca_bray[,2], col= Setting)) +
geom_point(show.legend =TRUE) +
xlab(paste("PCo1 ", "(",round(imp_bray[1]*100,2),"%", ")", sep=""))+
ylab(paste("PCo2 ", "(",round(imp_bray[2]*100,2),"%", ")", sep=""))+
theme_bw() +
labs(title="PCoA on Bray-Curtis") +
scale_color_manual(values=c(
"#1E75B0", "#ABC3E3", "#FA7D0E","#FAB776",
"#2B9D2B", "#D90368", "#D22627","#FA9593",
"#9165B9", "#FAF33E", "#89544A","#C09991",
"#DF75BE", "#F2B2CE", "#1EFC1E"))
This boxplot shows the distribution of Bray-Curtis distances between individual mock community samples and reference abundances across different DADA2’s settings. Boxplot reveals that DADA2 at different settings is not consistent in the results. Worse results are achieved by settings with maxMismatch=0, which does not allow any mismatches when merging reads in the pipeline.
rownames(mock_b_dist)[which(rownames(mock_b_dist)=="S1 Mock.Reference")] <- "ZYMO REFERENCE"
mock_b_dist["Setting"] <- data_for_pca_bray$Setting
mock_b_dist_dada2 <- mock_b_dist
ggplot(data=mock_b_dist,aes(x=Setting, y=bray_curtis, fill=Setting)) +
geom_boxplot() +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90))
CREATE A DATAFRAME
mock_b_dist_settings <- data.frame(
row.names = unique(regmatches(rownames(mock_b_dist), regexpr("Mock\\d+_\\d+", rownames(mock_b_dist)))),
`S1` = mock_b_dist[grep("S1 ",rownames(mock_b_dist)),"bray_curtis"],
`S2` = mock_b_dist[grep("S2 ",rownames(mock_b_dist)),"bray_curtis"],
`S3` = mock_b_dist[grep("S3 ",rownames(mock_b_dist)),"bray_curtis"],
`S4` = mock_b_dist[grep("S4 ",rownames(mock_b_dist)),"bray_curtis"],
`S5` = mock_b_dist[grep("S5 ",rownames(mock_b_dist)),"bray_curtis"],
`S6` = mock_b_dist[grep("S6 ",rownames(mock_b_dist)),"bray_curtis"],
`S7` = mock_b_dist[grep("S7 ",rownames(mock_b_dist)),"bray_curtis"],
`S8` = mock_b_dist[grep("S8 ",rownames(mock_b_dist)),"bray_curtis"],
`S9` = mock_b_dist[grep("S9 ",rownames(mock_b_dist)),"bray_curtis"],
`S10` = mock_b_dist[grep("S10 ",rownames(mock_b_dist)),"bray_curtis"],
`S11` = mock_b_dist[grep("S11 ",rownames(mock_b_dist)),"bray_curtis"],
`S12` = mock_b_dist[grep("S12 ",rownames(mock_b_dist)),"bray_curtis"],
`S13` = mock_b_dist[grep("S13 ",rownames(mock_b_dist)),"bray_curtis"],
`S14` = mock_b_dist[grep("S14",rownames(mock_b_dist)),"bray_curtis"])
mock_b_dist_settings_dada2 <- mock_b_dist_settings
HEATMAP OF ALL MOCK SAMPLES
The heatmap visualizes the same information as the boxplot, but for each sample separately for a more detailed overview.
In this visualization it can be clearly seen that the mock samples processed with maxMismatch=0 parameter have higher Bray-Curtis distance in general.
annotation_df <- data.frame(maxEE = c("c(2,2)","c(2,2)","c(2,3)","c(2,3)","c(2,4)","c(2,4)","c(2,5)","c(2,5)", "c(1,3)", "c(1,3)", "c(1,4)","c(1,4)","c(1,5)","c(1,5)"),maxMismatch = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1),row.names = paste0("S",(1:14)))
pheatmap(mock_b_dist_settings,cluster_rows = FALSE, cellheight = 10, annotation_col = annotation_df)
AVERAGE BRAY CURTIS
From this heatmap, a clear difference is visible between the average performance for all samples. The best results for Bray-Curtis distance were achieved by settings S4 and S12.
pheatmap(mock_b_dist_settings,cluster_rows = FALSE, cellheight = 10, kmeans_k = 1,annotation_col = annotation_df)
The difference in Bray-Curtis distances is tested by the Kruskal-Wallis test and Dunn’s test.
print("Kruskal Wallis")
## [1] "Kruskal Wallis"
kruskal.test(bray_curtis ~ Setting, data = mock_b_dist)
##
## Kruskal-Wallis rank sum test
##
## data: bray_curtis by Setting
## Kruskal-Wallis chi-squared = 189.02, df = 14, p-value < 2.2e-16
print("Post-hoc")
## [1] "Post-hoc"
dunnTest(bray_curtis ~ Setting,
data=mock_b_dist,
method="bh")
## Warning: Setting was coerced to a factor.
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Benjamini-Hochberg method.
## Comparison Z P.unadj P.adj
## 1 S1 - S10 4.15255900 3.287779e-05 7.845837e-05
## 2 S1 - S11 -1.26382230 2.062938e-01 3.610142e-01
## 3 S10 - S11 -5.41638131 6.081733e-08 3.991137e-07
## 4 S1 - S12 5.20546303 1.935135e-07 8.834311e-07
## 5 S10 - S12 1.05290403 2.923850e-01 4.514768e-01
## 6 S11 - S12 6.46928534 9.846750e-11 3.446363e-09
## 7 S1 - S13 -0.83861106 4.016876e-01 5.207061e-01
## 8 S10 - S13 -4.99117006 6.001463e-07 2.333902e-06
## 9 S11 - S13 0.42521124 6.706827e-01 7.824631e-01
## 10 S12 - S13 -6.04407409 1.502704e-09 2.629733e-08
## 11 S1 - S14 4.28923404 1.792904e-05 4.378021e-05
## 12 S10 - S14 0.13667504 8.912877e-01 9.085942e-01
## 13 S11 - S14 5.55305635 2.807175e-08 2.456278e-07
## 14 S12 - S14 -0.91622899 3.595468e-01 4.902911e-01
## 15 S13 - S14 5.12784511 2.930774e-07 1.230925e-06
## 16 S1 - S2 4.31454424 1.599325e-05 3.998313e-05
## 17 S10 - S2 0.16198524 8.713175e-01 9.058251e-01
## 18 S11 - S2 5.57836654 2.427876e-08 2.317518e-07
## 19 S12 - S2 -0.89091879 3.729727e-01 5.020787e-01
## 20 S13 - S2 5.15315530 2.561397e-07 1.120611e-06
## 21 S14 - S2 0.02531019 9.798075e-01 9.892288e-01
## 22 S1 - S3 -0.51126590 6.091649e-01 7.186777e-01
## 23 S10 - S3 -4.66382490 3.103855e-06 1.018452e-05
## 24 S11 - S3 0.75255641 4.517165e-01 5.784175e-01
## 25 S12 - S3 -5.71672893 1.085941e-08 1.266932e-07
## 26 S13 - S3 0.32734516 7.434068e-01 8.304013e-01
## 27 S14 - S3 -4.80049994 1.582700e-06 5.730466e-06
## 28 S2 - S3 -4.82581014 1.394353e-06 5.228824e-06
## 29 S1 - S4 5.37588500 7.620742e-08 4.211463e-07
## 30 S10 - S4 1.22332600 2.212066e-01 3.807654e-01
## 31 S11 - S4 6.63970730 3.143067e-11 3.300220e-09
## 32 S12 - S4 0.17042197 8.646783e-01 9.170830e-01
## 33 S13 - S4 6.21449606 5.148963e-10 1.081282e-08
## 34 S14 - S4 1.08665095 2.771911e-01 4.409858e-01
## 35 S2 - S4 1.06134076 2.885351e-01 4.521818e-01
## 36 S3 - S4 5.88715090 3.929100e-09 5.893649e-08
## 37 S1 - S5 -0.16367258 8.699889e-01 9.134883e-01
## 38 S10 - S5 -4.31623158 1.587155e-05 4.064664e-05
## 39 S11 - S5 1.10014972 2.712669e-01 4.450472e-01
## 40 S12 - S5 -5.36913561 7.911491e-08 4.153533e-07
## 41 S13 - S5 0.67493848 4.997149e-01 6.101170e-01
## 42 S14 - S5 -4.45290663 8.471560e-06 2.340826e-05
## 43 S2 - S5 -4.47821682 7.526913e-06 2.195350e-05
## 44 S3 - S5 0.34759332 7.281456e-01 8.310358e-01
## 45 S4 - S5 -5.53955758 3.032367e-08 2.274276e-07
## 46 S1 - S6 3.30213652 9.595136e-04 2.056101e-03
## 47 S10 - S6 -0.85042249 3.950902e-01 5.185559e-01
## 48 S11 - S6 4.56595882 4.972164e-06 1.582052e-05
## 49 S12 - S6 -1.90332652 5.699795e-02 1.031859e-01
## 50 S13 - S6 4.14074758 3.461757e-05 8.077434e-05
## 51 S14 - S6 -0.98709753 3.235948e-01 4.654446e-01
## 52 S2 - S6 -1.01240772 3.113431e-01 4.670147e-01
## 53 S3 - S6 3.81340242 1.370667e-04 3.128696e-04
## 54 S4 - S6 -2.07374848 3.810268e-02 7.018914e-02
## 55 S5 - S6 3.46580910 5.286385e-04 1.156397e-03
## 56 S1 - S7 -1.12545992 2.603942e-01 4.339903e-01
## 57 S10 - S7 -5.27801892 1.305880e-07 6.232608e-07
## 58 S11 - S7 0.13836239 8.899540e-01 9.161291e-01
## 59 S12 - S7 -6.33092295 2.436990e-10 6.397099e-09
## 60 S13 - S7 -0.28684885 7.742281e-01 8.557257e-01
## 61 S14 - S7 -5.41469396 6.139362e-08 3.791959e-07
## 62 S2 - S7 -5.44000415 5.327933e-08 3.729553e-07
## 63 S3 - S7 -0.61419402 5.390871e-01 6.432290e-01
## 64 S4 - S7 -6.50134491 7.960508e-11 4.179267e-09
## 65 S5 - S7 -0.96178733 3.361564e-01 4.769787e-01
## 66 S6 - S7 -4.42759643 9.528897e-06 2.565472e-05
## 67 S1 - S8 4.50858905 6.526018e-06 2.015388e-05
## 68 S10 - S8 0.35603005 7.218181e-01 8.328670e-01
## 69 S11 - S8 5.77241136 7.814506e-09 1.025654e-07
## 70 S12 - S8 -0.69687398 4.858817e-01 6.073521e-01
## 71 S13 - S8 5.34720011 8.932524e-08 4.466262e-07
## 72 S14 - S8 0.21935501 8.263735e-01 9.038460e-01
## 73 S2 - S8 0.19404481 8.461408e-01 9.159256e-01
## 74 S3 - S8 5.01985495 5.171051e-07 2.088309e-06
## 75 S4 - S8 -0.86729595 3.857799e-01 5.127454e-01
## 76 S5 - S8 4.67226163 2.979012e-06 1.009020e-05
## 77 S6 - S8 1.20645253 2.276431e-01 3.855245e-01
## 78 S7 - S8 5.63404897 1.760270e-08 1.848284e-07
## 79 S1 - S9 -0.17379666 8.620253e-01 9.235985e-01
## 80 S10 - S9 -4.32635566 1.515966e-05 3.979410e-05
## 81 S11 - S9 1.09002565 2.757018e-01 4.453645e-01
## 82 S12 - S9 -5.37925969 7.479276e-08 4.362911e-07
## 83 S13 - S9 0.66481440 5.061692e-01 6.108939e-01
## 84 S14 - S9 -4.46303070 8.080847e-06 2.293213e-05
## 85 S2 - S9 -4.48834090 7.178000e-06 2.153400e-05
## 86 S3 - S9 0.33746924 7.357632e-01 8.307004e-01
## 87 S4 - S9 -5.54968166 2.861902e-08 2.311537e-07
## 88 S5 - S9 -0.01012408 9.919223e-01 9.919223e-01
## 89 S6 - S9 -3.47593317 5.090791e-04 1.137304e-03
## 90 S7 - S9 0.95166326 3.412678e-01 4.777749e-01
## 91 S8 - S9 -4.68238571 2.835553e-06 9.924435e-06
## 92 S1 - ZYMO REFERENCE 2.26146133 2.373070e-02 4.449507e-02
## 93 S10 - ZYMO REFERENCE 1.03693848 2.997645e-01 4.561634e-01
## 94 S11 - ZYMO REFERENCE 2.63414219 8.435015e-03 1.771353e-02
## 95 S12 - ZYMO REFERENCE 0.72645402 4.675605e-01 5.914922e-01
## 96 S13 - ZYMO REFERENCE 2.50875424 1.211577e-02 2.446454e-02
## 97 S14 - ZYMO REFERENCE 0.99663521 3.189416e-01 4.716742e-01
## 98 S2 - ZYMO REFERENCE 0.98917164 3.225792e-01 4.704280e-01
## 99 S3 - ZYMO REFERENCE 2.41222542 1.585548e-02 3.141179e-02
## 100 S4 - ZYMO REFERENCE 0.67619933 4.989141e-01 6.163057e-01
## 101 S5 - ZYMO REFERENCE 2.30972574 2.090334e-02 3.990638e-02
## 102 S6 - ZYMO REFERENCE 1.28771439 1.978454e-01 3.520977e-01
## 103 S7 - ZYMO REFERENCE 2.59334135 9.504838e-03 1.956878e-02
## 104 S8 - ZYMO REFERENCE 0.93195095 3.513619e-01 4.854342e-01
## 105 S9 - ZYMO REFERENCE 2.31271117 2.073852e-02 4.032491e-02
VSEARCH-UNOISE3 and Deblur operate with merged reads, unlike DADA2. Therefore, the initial common step for these two pipelines was merging the paired reads. This was conducted using the ‘vsearch –fastq mergepairs’ command, with different values of minimum overlap region length and maximum allowed mismatches. the subsequent steps differed for each pipeline.
In VSEARCH, the reads underwent reorientation and filtering, with various settings for expected error and truncation length. These filtered reads were then denoised using UNOISE3, which is implemented in VSEARCH. Chimeras were discarded using ‘uchime3 denovo’. Finally, taxonomic assignment was performed using SINTAX.
Since the output of the SINTAX taxonomic classifier is also a probability matrix of classification, those with a probability of less than 0.7 were reassigned as ‘unclassified’, while others remained unchanged.
vsearch_settings
asv_table <- read.csv("results_vsearch/1/all.otutab.csv", sep="\t")
colnames(asv_table) <- c("ASV",colnames(asv_table)[-1])
asv_table
taxa_table_vsearch <- read.csv("results_vsearch/1/sintax.csv", sep=",", header = FALSE)
colnames(taxa_table_vsearch) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
taxa_table_vsearch
SPLITING PROBABILITES AND TAXA TABLE
taxa_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
taxa_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr(":.+[(]", taxa_table_vsearch[,i]))
taxa_table[,i] <- substring(taxa_table[,i],2,nchar(taxa_table[,i])-1)
}
colnames(taxa_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
prob_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
prob_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr("[(].+[)]", taxa_table_vsearch[,i]))
prob_table[,i] <- substring(prob_table[,i],2,nchar(prob_table[,i])-1)
}
colnames(prob_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
# remove assignment with less than 0.7 probability
taxa_table[prob_table[,]<0.7] <- NA
taxa_table[taxa_table=="Limosilactobacillus"] <- "Lactobacillus"
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting1_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 1"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- data.frame(samples=NA)
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_vsearch/2/all.otutab.csv", sep="\t")
colnames(asv_table) <- c("ASV",colnames(asv_table)[-1])
asv_table
taxa_table_vsearch <- read.csv("results_vsearch/2/sintax.csv", sep=",", header = FALSE)
colnames(taxa_table_vsearch) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
taxa_table_vsearch
SPLITING PROBABILITES AND TAXA TABLE
taxa_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
taxa_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr(":.+[(]", taxa_table_vsearch[,i]))
taxa_table[,i] <- substring(taxa_table[,i],2,nchar(taxa_table[,i])-1)
}
colnames(taxa_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
prob_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
prob_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr("[(].+[)]", taxa_table_vsearch[,i]))
prob_table[,i] <- substring(prob_table[,i],2,nchar(prob_table[,i])-1)
}
colnames(prob_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
# remove assignment with less than 0.7 probability
taxa_table[prob_table[,]<0.7] <- NA
taxa_table[taxa_table=="Limosilactobacillus"] <- "Lactobacillus"
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting2_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 2"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_vsearch/3/all.otutab.csv", sep="\t")
colnames(asv_table) <- c("ASV",colnames(asv_table)[-1])
asv_table
taxa_table_vsearch <- read.csv("results_vsearch/3/sintax.csv", sep=",", header = FALSE)
colnames(taxa_table_vsearch) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
taxa_table_vsearch
SPLITING PROBABILITES AND TAXA TABLE
taxa_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
taxa_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr(":.+[(]", taxa_table_vsearch[,i]))
taxa_table[,i] <- substring(taxa_table[,i],2,nchar(taxa_table[,i])-1)
}
colnames(taxa_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
prob_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
prob_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr("[(].+[)]", taxa_table_vsearch[,i]))
prob_table[,i] <- substring(prob_table[,i],2,nchar(prob_table[,i])-1)
}
colnames(prob_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
# remove assignment with less than 0.7 probability
taxa_table[prob_table[,]<0.7] <- NA
taxa_table[taxa_table=="Limosilactobacillus"] <- "Lactobacillus"
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting3_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 3"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_vsearch/4/all.otutab.csv", sep="\t")
colnames(asv_table) <- c("ASV",colnames(asv_table)[-1])
asv_table
taxa_table_vsearch <- read.csv("results_vsearch/4/sintax.csv", sep=",", header = FALSE)
colnames(taxa_table_vsearch) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
taxa_table_vsearch
SPLITING PROBABILITES AND TAXA TABLE
taxa_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
taxa_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr(":.+[(]", taxa_table_vsearch[,i]))
taxa_table[,i] <- substring(taxa_table[,i],2,nchar(taxa_table[,i])-1)
}
colnames(taxa_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
prob_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
prob_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr("[(].+[)]", taxa_table_vsearch[,i]))
prob_table[,i] <- substring(prob_table[,i],2,nchar(prob_table[,i])-1)
}
colnames(prob_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
# remove assignment with less than 0.7 probability
taxa_table[prob_table[,]<0.7] <- NA
taxa_table[taxa_table=="Limosilactobacillus"] <- "Lactobacillus"
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting4_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 4"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_vsearch/5/all.otutab.csv", sep="\t")
colnames(asv_table) <- c("ASV",colnames(asv_table)[-1])
asv_table
taxa_table_vsearch <- read.csv("results_vsearch/5/sintax.csv", sep=",", header = FALSE)
colnames(taxa_table_vsearch) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
taxa_table_vsearch
SPLITING PROBABILITES AND TAXA TABLE
taxa_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
taxa_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr(":.+[(]", taxa_table_vsearch[,i]))
taxa_table[,i] <- substring(taxa_table[,i],2,nchar(taxa_table[,i])-1)
}
colnames(taxa_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
prob_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
prob_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr("[(].+[)]", taxa_table_vsearch[,i]))
prob_table[,i] <- substring(prob_table[,i],2,nchar(prob_table[,i])-1)
}
colnames(prob_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
# remove assignment with less than 0.7 probability
taxa_table[prob_table[,]<0.7] <- NA
taxa_table[taxa_table=="Limosilactobacillus"] <- "Lactobacillus"
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting5_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 5"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_vsearch/6/all.otutab.csv", sep="\t")
colnames(asv_table) <- c("ASV",colnames(asv_table)[-1])
asv_table
taxa_table_vsearch <- read.csv("results_vsearch/6/sintax.csv", sep=",", header = FALSE)
colnames(taxa_table_vsearch) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
taxa_table_vsearch
SPLITING PROBABILITES AND TAXA TABLE
taxa_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
taxa_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr(":.+[(]", taxa_table_vsearch[,i]))
taxa_table[,i] <- substring(taxa_table[,i],2,nchar(taxa_table[,i])-1)
}
colnames(taxa_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
prob_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
prob_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr("[(].+[)]", taxa_table_vsearch[,i]))
prob_table[,i] <- substring(prob_table[,i],2,nchar(prob_table[,i])-1)
}
colnames(prob_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
# remove assignment with less than 0.7 probability
taxa_table[prob_table[,]<0.7] <- NA
taxa_table[taxa_table=="Limosilactobacillus"] <- "Lactobacillus"
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting6_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 6"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_vsearch/7/all.otutab.csv", sep="\t")
colnames(asv_table) <- c("ASV",colnames(asv_table)[-1])
asv_table
taxa_table_vsearch <- read.csv("results_vsearch/7/sintax.csv", sep=",", header = FALSE)
colnames(taxa_table_vsearch) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
taxa_table_vsearch
taxa_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
taxa_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr(":.+[(]", taxa_table_vsearch[,i]))
taxa_table[,i] <- substring(taxa_table[,i],2,nchar(taxa_table[,i])-1)
}
colnames(taxa_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
prob_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
prob_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr("[(].+[)]", taxa_table_vsearch[,i]))
prob_table[,i] <- substring(prob_table[,i],2,nchar(prob_table[,i])-1)
}
colnames(prob_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
# remove assignment with less than 0.7 probability
taxa_table[prob_table[,]<0.7] <- NA
taxa_table[taxa_table=="Limosilactobacillus"] <- "Lactobacillus"
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting7_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 7"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_vsearch/8/all.otutab.csv", sep="\t")
colnames(asv_table) <- c("ASV",colnames(asv_table)[-1])
asv_table
taxa_table_vsearch <- read.csv("results_vsearch/8/sintax.csv", sep=",", header = FALSE)
colnames(taxa_table_vsearch) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
taxa_table_vsearch
SPLITING PROBABILITES AND TAXA TABLE
taxa_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
taxa_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr(":.+[(]", taxa_table_vsearch[,i]))
taxa_table[,i] <- substring(taxa_table[,i],2,nchar(taxa_table[,i])-1)
}
colnames(taxa_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
prob_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
prob_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr("[(].+[)]", taxa_table_vsearch[,i]))
prob_table[,i] <- substring(prob_table[,i],2,nchar(prob_table[,i])-1)
}
colnames(prob_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
# remove assignment with less than 0.7 probability
taxa_table[prob_table[,]<0.7] <- NA
taxa_table[taxa_table=="Limosilactobacillus"] <- "Lactobacillus"
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting8_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 8"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
asv_table <- read.csv("results_vsearch/9/all.otutab.csv", sep="\t")
colnames(asv_table) <- c("ASV",colnames(asv_table)[-1])
asv_table
taxa_table_vsearch <- read.csv("results_vsearch/9/sintax.csv", sep=",", header = FALSE)
colnames(taxa_table_vsearch) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
taxa_table_vsearch
taxa_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
taxa_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr(":.+[(]", taxa_table_vsearch[,i]))
taxa_table[,i] <- substring(taxa_table[,i],2,nchar(taxa_table[,i])-1)
}
colnames(taxa_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
prob_table <- data.frame(ASV = taxa_table_vsearch$ASV)
for (i in 2:8){
prob_table[,i] <- regmatches(taxa_table_vsearch[,i], regexpr("[(].+[)]", taxa_table_vsearch[,i]))
prob_table[,i] <- substring(prob_table[,i],2,nchar(prob_table[,i])-1)
}
colnames(prob_table) <- c("ASV","Domain","Phylum","Class","Order", "Family","Genus","Species")
# remove assignment with less than 0.7 probability
taxa_table[prob_table[,]<0.7] <- NA
taxa_table[taxa_table=="Limosilactobacillus"] <- "Lactobacillus"
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting9_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"Mock.Reference"])
colnames(mock_b_dist) <- "SETTING 9"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
On the heatmap, there is the Bray-Curtis distance for each sample and the VSEARCH settings with which they were processed. The settings are color coded. The smaller the Bray-Curtis distance, the more similar the sample is to the reference. Negative controls are less similar, therefore their distance is larger (parts with red colouring).
Even from this perspective it can be seen that unlike DADA2, VSEARCH is consistent in its results across different settings and there is no mock community sample that has a significantly higher Bray-Curtis distance (mock samples in blue colors).
annotation_df <- data.frame(max_diffs = c(2,2,3,3,2,2,2,2,2),fastq_maxee = c(1,1,1,2,2,2,1,1,2),
fastq_trunclen = c(250,350,350,350,350,400,400,400,400),fastq_minovlen=c(10,12,12,12,12,12,12,10,10),
row.names = paste("SETTING",(1:9)))
bray_curtis_distances$samples = rownames(bray_curtis_distances)
pheatmap(bray_curtis_distances[-41,-which(names(bray_curtis_distances) == "samples")],cluster_rows = FALSE, cellheight = 6, annotation_col = annotation_df, )
First, the one final variable has to be created to store the results for all settings (all_settings_genus_object).
colnames(setting1_genus)[7:ncol(setting1_genus)] <- paste("S1", colnames(setting1_genus)[7:(ncol(setting1_genus))])
colnames(setting2_genus)[7:ncol(setting2_genus)] <- paste("S2", colnames(setting2_genus)[7:ncol(setting2_genus)])
colnames(setting3_genus)[7:ncol(setting3_genus)] <- paste("S3", colnames(setting3_genus)[7:ncol(setting3_genus)])
colnames(setting4_genus)[7:ncol(setting4_genus)] <- paste("S4", colnames(setting4_genus)[7:ncol(setting4_genus)])
colnames(setting5_genus)[7:ncol(setting5_genus)] <- paste("S5", colnames(setting5_genus)[7:ncol(setting5_genus)])
colnames(setting6_genus)[7:ncol(setting6_genus)] <- paste("S6", colnames(setting6_genus)[7:ncol(setting6_genus)])
colnames(setting7_genus)[7:ncol(setting7_genus)] <- paste("S7", colnames(setting7_genus)[7:ncol(setting7_genus)])
colnames(setting8_genus)[7:ncol(setting8_genus)] <- paste("S8", colnames(setting8_genus)[7:ncol(setting8_genus)])
colnames(setting9_genus)[7:ncol(setting9_genus)] <- paste("S9", colnames(setting9_genus)[7:ncol(setting9_genus)])
all_settings_genus <- merge(setting1_genus,setting2_genus[,-which(colnames(setting2_genus)=="S2 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting3_genus[,-which(colnames(setting3_genus)=="S3 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting4_genus[,-which(colnames(setting4_genus)=="S4 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting5_genus[,-which(colnames(setting5_genus)=="S5 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting6_genus[,-which(colnames(setting6_genus)=="S6 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting7_genus[,-which(colnames(setting7_genus)=="S7 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting8_genus[,-which(colnames(setting8_genus)=="S8 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting9_genus[,-which(colnames(setting9_genus)=="S9 Mock.Reference")], all=TRUE)
all_settings_genus[is.na(all_settings_genus)] <- 0
#phyloseq object
all_settings_genus_asv <- all_settings_genus[,7:ncol(all_settings_genus)]
all_settings_genus_asv["ASV"] <- paste("ASV",1:nrow(all_settings_genus_asv))
all_settings_genus_taxa <- all_settings_genus[,1:6]
all_settings_genus_taxa["ASV"] <- paste("ASV",1:nrow(all_settings_genus_taxa))
all_settings_genus_object <- construct_phyloseq(all_settings_genus_asv, all_settings_genus_taxa)
ORDINATION - PCoA
Here, the Principal Coordinate Analysis (PCoA) is used to visualize the distance matrix. Mock community samples cluster separately from negative controls.
Compared to PCoA from the DADA2 pipeline, it can be seen that the mock community samples are more similar to each other.
ord_b <- ordinate(all_settings_genus_object, method = "PCoA", distance = "bray")
data_for_pca_bray <- as.data.frame(t(all_settings_genus_object@otu_table))
data_for_pca_bray["setting"] <- substring(rownames(data_for_pca_bray),1,3)
data_for_pca_bray["sample"] <- regmatches(rownames(data_for_pca_bray), regexpr("(Mock)|(NC)|(Mock.Reference)", rownames(data_for_pca_bray)))
data_for_pca_bray[data_for_pca_bray$sample=="Mock.Reference","sample"] <- "ZYMO REFERENCE"
imp_bray <- ord_b$values$Relative_eig
pca_bray <- ord_b$vectors
ggplot(data_for_pca_bray, aes(x=pca_bray[,1],y=pca_bray[,2], col= sample)) +
geom_point(show.legend =TRUE) +
xlab(paste("PCo1 ", "(",round(imp_bray[1]*100,2),"%", ")", sep=""))+
ylab(paste("PCo2 ", "(",round(imp_bray[2]*100,2),"%", ")", sep=""))+
theme_bw() +
labs(title="PCoA on Bray-Curtis")
Next, the mock community samples without the negative control presence are analyzed, to investigate the patterns in taxonomic composition revealed by individual bioinformatics pipeline.
An interesting finding is that while all settings are nearly identical, S1 differs, with some samples being considerably close to the reference.
One would expect that the reverse might be the case, as setting S1 involves trimming to 250 bp, which significantly reduces the resolution for the denoising algorithm and the taxonomic classifier.
# WITHOUT NC
all_settings_genus_asv_mock <- all_settings_genus_asv[,grep("(Mock)|(ASV)",colnames(all_settings_genus_asv))]
to_filter <- (rowSums((all_settings_genus_asv_mock[,-which(colnames(all_settings_genus_asv_mock)=="ASV")])>0))>0
all_settings_genus_asv_mock <- all_settings_genus_asv_mock[to_filter,]
all_settings_genus_taxa_mock <- all_settings_genus_taxa[to_filter,]
# BRAY CURTIS
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(all_settings_genus_asv_mock[,-which(colnames(all_settings_genus_asv_mock)=="ASV")])), method="bray"))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"S1 Mock.Reference"])
rownames(mock_b_dist) <- samples
colnames(mock_b_dist) <- "bray_curtis"
all_settings_genus_mock_object <- construct_phyloseq(all_settings_genus_asv_mock, all_settings_genus_taxa_mock)
all_settings_genus_mock_object_vsearch <- all_settings_genus_mock_object
# ORDINATION
ord_b <- ordinate(all_settings_genus_mock_object, method = "PCoA", distance = "bray")
data_for_pca_bray <- as.data.frame(t(all_settings_genus_mock_object@otu_table))
data_for_pca_bray["Setting"] <- substring(rownames(data_for_pca_bray),1,3)
data_for_pca_bray["sample"] <- regmatches(rownames(data_for_pca_bray), regexpr("(Mock)|(Mock.Reference)", rownames(data_for_pca_bray)))
data_for_pca_bray[data_for_pca_bray$sample=="Mock.Reference","sample"] <- "ZYMO REFERENCE"
data_for_pca_bray[data_for_pca_bray$sample=="ZYMO REFERENCE","Setting"] <- "ZYMO REFERENCE"
imp_bray <- ord_b$values$Relative_eig
pca_bray <- ord_b$vectors
ggplot(data_for_pca_bray, aes(x=pca_bray[,1],y=pca_bray[,2], col= Setting)) +
geom_point(show.legend =TRUE) +
xlab(paste("PCo1 ", "(",round(imp_bray[1]*100,2),"%", ")", sep=""))+
ylab(paste("PCo2 ", "(",round(imp_bray[2]*100,2),"%", ")", sep=""))+
theme_bw() +
labs(title="PCoA on Bray-Curtis") +
scale_color_manual(values=c(
"#1E75B0", "#ABC3E3", "#FA7D0E","#FAB776",
"#2B9D2B", "#D90368", "#D22627","#FA9593",
"#9165B9", "#1EFC1E"))
This boxplot shows the distribution of Bray-Curtis distances between individual mock community samples and reference abundances across different VSEARCH’s settings.
This confirms the observation from the PCoA plot, where the S1 setup shows an overall smaller Bray-Curtis distance than the other setups.
rownames(mock_b_dist)[23] <- "ZYMO REFERENCE"
mock_b_dist["Setting"] <- data_for_pca_bray$Setting
mock_b_dist_vsearch <- mock_b_dist
ggplot(data=mock_b_dist,aes(x=Setting, y=bray_curtis, fill=Setting)) +
geom_boxplot() +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90))
The heatmap visualizes the same information as the boxplot, but for each sample separately for a more detailed overview.
mock_b_dist_settings <- data.frame(
row.names = unique(regmatches(rownames(mock_b_dist), regexpr("Mock\\d+_\\d+", rownames(mock_b_dist)))),
`S1` = mock_b_dist[grep("S1 ",rownames(mock_b_dist)),"bray_curtis"],
`S2` = mock_b_dist[grep("S2 ",rownames(mock_b_dist)),"bray_curtis"],
`S3` = mock_b_dist[grep("S3 ",rownames(mock_b_dist)),"bray_curtis"],
`S4` = mock_b_dist[grep("S4 ",rownames(mock_b_dist)),"bray_curtis"],
`S5` = mock_b_dist[grep("S5 ",rownames(mock_b_dist)),"bray_curtis"],
`S6` = mock_b_dist[grep("S6 ",rownames(mock_b_dist)),"bray_curtis"],
`S7` = mock_b_dist[grep("S7 ",rownames(mock_b_dist)),"bray_curtis"],
`S8` = mock_b_dist[grep("S8 ",rownames(mock_b_dist)),"bray_curtis"],
`S9` = mock_b_dist[grep("S9 ",rownames(mock_b_dist)),"bray_curtis"])
mock_b_dist_settings_vsearch <- mock_b_dist_settings
HEATMAP OF ALL MOCK SAMPLES
annotation_df <- data.frame(max_diffs = c(2,2,3,3,2,2,2,2,2),fastq_maxee = c(1,1,1,2,2,2,1,1,2),
fastq_trunclen = c(250,350,350,350,350,400,400,400,400),fastq_minovlen=c(10,12,12,12,12,12,12,10,10),
row.names = paste0("S",(1:9)))
pheatmap(mock_b_dist_settings,cluster_rows = FALSE, cellheight = 10, annotation_col = annotation_df)
AVERAGE BRAY CURTIS
It is also clearly shown here that the S1 setup shows an overall the best performance than the other setups.
pheatmap(mock_b_dist_settings,cluster_rows = FALSE, cellheight = 10, kmeans_k = 1,annotation_col = annotation_df)
The difference in Bray-Curtis distances is tested by the Kruskal-Wallis test and Dunn’s test.
print("Kruskal Wallis")
## [1] "Kruskal Wallis"
kruskal.test(bray_curtis ~ Setting, data = mock_b_dist)
##
## Kruskal-Wallis rank sum test
##
## data: bray_curtis by Setting
## Kruskal-Wallis chi-squared = 31.512, df = 9, p-value = 0.0002418
print("Post-hoc")
## [1] "Post-hoc"
dunnTest(bray_curtis ~ Setting,
data=mock_b_dist,
method="bh")
## Warning: Setting was coerced to a factor.
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Benjamini-Hochberg method.
## Comparison Z P.unadj P.adj
## 1 S1 - S2 -4.02343154 5.735626e-05 0.002581032
## 2 S1 - S3 -3.92134057 8.805770e-05 0.001320866
## 3 S2 - S3 0.10209098 9.186845e-01 0.961413974
## 4 S1 - S4 -1.96328800 4.961272e-02 0.159469454
## 5 S2 - S4 2.06014354 3.938482e-02 0.196924102
## 6 S3 - S4 1.95805257 5.022384e-02 0.150671535
## 7 S1 - S5 -2.09417387 3.624449e-02 0.233000324
## 8 S2 - S5 1.92925767 5.369888e-02 0.151028104
## 9 S3 - S5 1.82716670 6.767471e-02 0.169186773
## 10 S4 - S5 -0.13088587 8.958656e-01 0.959855995
## 11 S1 - S6 -2.93707885 3.313199e-03 0.029818787
## 12 S2 - S6 1.08635269 2.773230e-01 0.430328766
## 13 S3 - S6 0.98426172 3.249868e-01 0.471755101
## 14 S4 - S6 -0.97379085 3.301604e-01 0.464288060
## 15 S5 - S6 -0.84290498 3.992816e-01 0.528460924
## 16 S1 - S7 -3.97631263 6.999210e-05 0.001574822
## 17 S2 - S7 0.04711891 9.624185e-01 0.962418454
## 18 S3 - S7 -0.05497206 9.561607e-01 0.977891645
## 19 S4 - S7 -2.01302463 4.411204e-02 0.165420144
## 20 S5 - S7 -1.88213876 5.981718e-02 0.158339588
## 21 S6 - S7 -1.03923378 2.986960e-01 0.448044035
## 22 S1 - S8 -3.69621687 2.188361e-04 0.002461906
## 23 S2 - S8 0.32721467 7.435055e-01 0.904263481
## 24 S3 - S8 0.22512369 8.218831e-01 0.924618433
## 25 S4 - S8 -1.73292887 8.310831e-02 0.178089239
## 26 S5 - S8 -1.60204301 1.091461e-01 0.204648936
## 27 S6 - S8 -0.75913803 4.477700e-01 0.575704278
## 28 S7 - S8 0.28009575 7.794040e-01 0.899312356
## 29 S1 - S9 -2.26694321 2.339370e-02 0.175452713
## 30 S2 - S9 1.75648833 7.900506e-02 0.187117253
## 31 S3 - S9 1.65439735 9.804680e-02 0.191830703
## 32 S4 - S9 -0.30365521 7.613906e-01 0.901646745
## 33 S5 - S9 -0.17276934 8.628327e-01 0.947011538
## 34 S6 - S9 0.67013564 5.027713e-01 0.628464161
## 35 S7 - S9 1.70936942 8.738254e-02 0.178737018
## 36 S8 - S9 1.42927366 1.529256e-01 0.245773273
## 37 S1 - ZYMO REFERENCE 0.87458845 3.817979e-01 0.520633464
## 38 S2 - ZYMO REFERENCE 2.06103369 3.929983e-02 0.221061524
## 39 S3 - ZYMO REFERENCE 2.03092870 4.226223e-02 0.172890933
## 40 S4 - ZYMO REFERENCE 1.45353050 1.460765e-01 0.243460864
## 41 S5 - ZYMO REFERENCE 1.49212663 1.356660e-01 0.234806453
## 42 S6 - ZYMO REFERENCE 1.74068575 8.173868e-02 0.183912023
## 43 S7 - ZYMO REFERENCE 2.04713908 4.064443e-02 0.182899940
## 44 S8 - ZYMO REFERENCE 1.96454334 4.946711e-02 0.171232321
## 45 S9 - ZYMO REFERENCE 1.54307353 1.228129e-01 0.221063283
Following the merging of reads, as explained in the previous section, the QIIME2-Deblur plugin was utilized, with the different sequence trim lengths, consistent with those used in VSEARCH. The taxonomic composition was obtained using IDTAXA, similar to the approach employed with DADA2. Taxonomic classifier was used in combination with SILVA v138.1 database.
deblur_settings
my_biom = read_biom("results_deblur/1/feature-table.biom")
## Warning in strsplit(conditionMessage(e), "\n"): unable to translate 'lexical error: invalid char in json text.
## <89>HDF (right here) ------^
## ' to a wide string
## Warning in strsplit(conditionMessage(e), "\n"): input string 1 is invalid
my_biom
## biom object.
## type: OTU table
## matrix_type: dense
## 355 rows and 40 columns
asv_table <- as.data.frame(as(biom_data(my_biom), "matrix"))
colnames(asv_table) <- gsub("-","_",colnames(asv_table))
asv_table
taxa_table <- read.csv("results_deblur/1/taxa_1.csv", sep="\t", row.names = 1)
taxa_table
asv_table["ASV"] <- taxa_table["ASV"]
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting1_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,41])
colnames(mock_b_dist) <- "SETTING 1"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- data.frame(samples=NA)
bray_curtis_distances <- cbind(mock_b_dist,bray_curtis_distances)
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
my_biom = read_biom("results_deblur/2/feature-table.biom")
## Warning in strsplit(conditionMessage(e), "\n"): unable to translate 'lexical error: invalid char in json text.
## <89>HDF (right here) ------^
## ' to a wide string
## Warning in strsplit(conditionMessage(e), "\n"): input string 1 is invalid
my_biom
## biom object.
## type: OTU table
## matrix_type: dense
## 378 rows and 40 columns
asv_table <- as.data.frame(as(biom_data(my_biom), "matrix"))
colnames(asv_table) <- gsub("-","_",colnames(asv_table))
asv_table
taxa_table <- read.csv("results_deblur/2/taxa_2.csv", sep="\t", row.names = 1)
taxa_table
asv_table["ASV"] <- taxa_table["ASV"]
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting2_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,41])
colnames(mock_b_dist) <- "SETTING 2"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- merge(mock_b_dist,bray_curtis_distances, by='row.names')
rownames(bray_curtis_distances) <- bray_curtis_distances$Row.names
bray_curtis_distances <- bray_curtis_distances[,-1]
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
my_biom = read_biom("results_deblur/4/feature-table.biom")
## Warning in strsplit(conditionMessage(e), "\n"): unable to translate 'lexical error: invalid char in json text.
## <89>HDF (right here) ------^
## ' to a wide string
## Warning in strsplit(conditionMessage(e), "\n"): input string 1 is invalid
my_biom
## biom object.
## type: OTU table
## matrix_type: dense
## 378 rows and 40 columns
asv_table <- as.data.frame(as(biom_data(my_biom), "matrix"))
colnames(asv_table) <- gsub("-","_",colnames(asv_table))
asv_table
taxa_table <- read.csv("results_deblur/4/taxa_4.csv", sep="\t", row.names = 1)
taxa_table
asv_table["ASV"] <- taxa_table["ASV"]
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting4_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,41])
colnames(mock_b_dist) <- "SETTING 4"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- merge(mock_b_dist,bray_curtis_distances, by='row.names')
rownames(bray_curtis_distances) <- bray_curtis_distances$Row.names
bray_curtis_distances <- bray_curtis_distances[,-1]
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
my_biom = read_biom("results_deblur/5/feature-table.biom")
## Warning in strsplit(conditionMessage(e), "\n"): unable to translate 'lexical error: invalid char in json text.
## <89>HDF (right here) ------^
## ' to a wide string
## Warning in strsplit(conditionMessage(e), "\n"): input string 1 is invalid
my_biom
## biom object.
## type: OTU table
## matrix_type: dense
## 378 rows and 40 columns
asv_table <- as.data.frame(as(biom_data(my_biom), "matrix"))
colnames(asv_table) <- gsub("-","_",colnames(asv_table))
asv_table
taxa_table <- read.csv("results_deblur/5/taxa_5.csv", sep="\t", row.names = 1)
taxa_table
asv_table["ASV"] <- taxa_table["ASV"]
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting5_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,41])
colnames(mock_b_dist) <- "SETTING 5"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- merge(mock_b_dist,bray_curtis_distances, by='row.names')
rownames(bray_curtis_distances) <- bray_curtis_distances$Row.names
bray_curtis_distances <- bray_curtis_distances[,-1]
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
my_biom = read_biom("results_deblur/6/feature-table.biom")
## Warning in strsplit(conditionMessage(e), "\n"): unable to translate 'lexical error: invalid char in json text.
## <89>HDF (right here) ------^
## ' to a wide string
## Warning in strsplit(conditionMessage(e), "\n"): input string 1 is invalid
my_biom
## biom object.
## type: OTU table
## matrix_type: dense
## 367 rows and 40 columns
asv_table <- as.data.frame(as(biom_data(my_biom), "matrix"))
colnames(asv_table) <- gsub("-","_",colnames(asv_table))
asv_table
taxa_table <- read.csv("results_deblur/6/taxa_6.csv", sep="\t", row.names = 1)
taxa_table
asv_table["ASV"] <- taxa_table["ASV"]
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting6_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,41])
colnames(mock_b_dist) <- "SETTING 6"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- merge(mock_b_dist,bray_curtis_distances, by='row.names')
rownames(bray_curtis_distances) <- bray_curtis_distances$Row.names
bray_curtis_distances <- bray_curtis_distances[,-1]
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
my_biom = read_biom("results_deblur/8/feature-table.biom")
## Warning in strsplit(conditionMessage(e), "\n"): unable to translate 'lexical error: invalid char in json text.
## <89>HDF (right here) ------^
## ' to a wide string
## Warning in strsplit(conditionMessage(e), "\n"): input string 1 is invalid
my_biom
## biom object.
## type: OTU table
## matrix_type: dense
## 367 rows and 40 columns
asv_table <- as.data.frame(as(biom_data(my_biom), "matrix"))
colnames(asv_table) <- gsub("-","_",colnames(asv_table))
asv_table
taxa_table <- read.csv("results_deblur/8/taxa_8.csv", sep=",")
taxa_table
asv_table["ASV"] <- taxa_table["ASV"]
READ COUNTS
read_counts(asv_table,input_numbers)
PHYLOSEQ OBJECT, FILTERING
# phyloseq object
mock_object <- construct_phyloseq(asv_table, taxa_table)
# filtering
mock_object_filt <- abundance_filtering(mock_object)
#nornalization
mock_object_n <- normalization(mock_object_filt)
BARPLOT
# merging samples
mock_with_reference <- merge_phyloseq(mock_object_n, zymo_object)
# plot with reference
p1 <- plot_bar(mock_with_reference, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
p2 <- plot_bar(mock_object, fill = "Genus") +
geom_bar(aes(color=Genus, fill=Genus), stat="identity", position="stack") +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
theme(legend.position = "none")
ggarrange(p1,p2,
ncol = 1, nrow = 2)
MERGING AND BRAY CURTIS
# Genus merging
mock_zymo_genus <- mock_zymo_genus_merging(taxa_table,asv_table,zymo_genus)
## `summarise()` has grouped output by 'Domain', 'Phylum', 'Class', 'Order',
## 'Family'. You can override using the `.groups` argument.
setting8_genus <- mock_zymo_genus
# Bray Curtis distance
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(as.data.frame(mock_zymo_genus[,7:ncol(mock_zymo_genus)])), method="bray")))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,41])
colnames(mock_b_dist) <- "SETTING 8"
rownames(mock_b_dist) <- samples
bray_curtis_distances <- merge(mock_b_dist,bray_curtis_distances, by='row.names')
rownames(bray_curtis_distances) <- bray_curtis_distances$Row.names
bray_curtis_distances <- bray_curtis_distances[,-1]
PRECISION AND RECALL
performance_mat <- rbind(performance_mat,c(precision(mock_object_filt,zymo_taxa), recall(mock_object_filt,zymo_taxa)))
On the heatmap, there is the Bray-Curtis distance for each sample and the Deblur settings with which they were processed. The settings are color coded. The smaller the Bray-Curtis distance, the more similar the sample is to the reference. Negative controls are less similar, therefore their distance is larger.
Similar results as VSEARCH can be observed, as Deblur consistent in its results across different settings and there is no mock community sample that has a significantly higher Bray-Curtis distance (mock samples in blue colors). However, setting S1 reports considerably higher Bray-Curtis distance, which can be the result of the high truncation length leading to low resolution of Deblur.
annotation_df <- data.frame(max_diffs = c(2,2,3,2,2,2),fastq_minovlen = c(10,12,12,12,12,10),
fastq_trunclen = c(250,350,350,350,400,400),
row.names = paste("SETTING",c(1,2,4,5,6,8)))
bray_curtis_distances$samples = rownames(bray_curtis_distances)
pheatmap(bray_curtis_distances[-1,-which(names(bray_curtis_distances) == "samples")],cluster_rows = FALSE, cellheight = 6, annotation_col = annotation_df, )
MERGING ALL SETTINGS TOGETHER
colnames(setting1_genus)[7:ncol(setting1_genus)] <- paste("S1", colnames(setting1_genus)[7:(ncol(setting1_genus))])
colnames(setting2_genus)[7:ncol(setting2_genus)] <- paste("S2", colnames(setting2_genus)[7:ncol(setting2_genus)])
colnames(setting4_genus)[7:ncol(setting4_genus)] <- paste("S4", colnames(setting4_genus)[7:ncol(setting4_genus)])
colnames(setting5_genus)[7:ncol(setting5_genus)] <- paste("S5", colnames(setting5_genus)[7:ncol(setting5_genus)])
colnames(setting6_genus)[7:ncol(setting6_genus)] <- paste("S6", colnames(setting6_genus)[7:ncol(setting6_genus)])
colnames(setting8_genus)[7:ncol(setting8_genus)] <- paste("S8", colnames(setting8_genus)[7:ncol(setting8_genus)])
all_settings_genus <- merge(setting1_genus,setting2_genus[,-which(colnames(setting2_genus)=="S2 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting4_genus[,-which(colnames(setting4_genus)=="S4 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting5_genus[,-which(colnames(setting5_genus)=="S5 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting6_genus[,-which(colnames(setting6_genus)=="S6 Mock.Reference")], all=TRUE)
all_settings_genus <- merge(all_settings_genus,setting8_genus[,-which(colnames(setting8_genus)=="S8 Mock.Reference")], all=TRUE)
all_settings_genus[is.na(all_settings_genus)] <- 0
#phyloseq object
all_settings_genus_asv <- all_settings_genus[,7:ncol(all_settings_genus)]
all_settings_genus_asv["ASV"] <- paste("ASV",1:nrow(all_settings_genus_asv))
all_settings_genus_taxa <- all_settings_genus[,1:6]
all_settings_genus_taxa["ASV"] <- paste("ASV",1:nrow(all_settings_genus_taxa))
all_settings_genus_object <- construct_phyloseq(all_settings_genus_asv, all_settings_genus_taxa)
ORDINATION - PCoA
Here, the Principal Coordinate Analysis (PCoA) is used to visualize the distance matrix. Mock community samples cluster separately from negative controls.
Compared to PCoA from the DADA2 pipeline, it can be seen that the mock community samples are more similar to each other (similar as VSEARCH).
ord_b <- ordinate(all_settings_genus_object, method = "PCoA", distance = "bray")
data_for_pca_bray <- as.data.frame(t(all_settings_genus_object@otu_table))
data_for_pca_bray["setting"] <- substring(rownames(data_for_pca_bray),1,3)
data_for_pca_bray["sample"] <- regmatches(rownames(data_for_pca_bray), regexpr("(Mock)|(NC)|(Mock.Reference)", rownames(data_for_pca_bray)))
data_for_pca_bray[data_for_pca_bray$sample=="Mock.Reference","sample"] <- "ZYMO REFERENCE"
imp_bray <- ord_b$values$Relative_eig
pca_bray <- ord_b$vectors
ggplot(data_for_pca_bray, aes(x=pca_bray[,1],y=pca_bray[,2], col= sample)) +
geom_point(show.legend =TRUE) +
xlab(paste("PCo1 ", "(",round(imp_bray[1]*100,2),"%", ")", sep=""))+
ylab(paste("PCo2 ", "(",round(imp_bray[2]*100,2),"%", ")", sep=""))+
theme_bw() +
labs(title="PCoA on Bray-Curtis")
Next, the mock community samples without the negative control presence are analyzed, to investigate the patterns in taxonomic composition revealed by individual bioinformatics pipeline.
Here, S1 setting forms the individual cluster, however, unlike VSEARCH, the distances are higher as expected due to lower resolution for the denoising algorithm and the taxonomic classifier.
# WITHOUT NC
all_settings_genus_asv_mock <- all_settings_genus_asv[,grep("(Mock)|(ASV)",colnames(all_settings_genus_asv))]
to_filter <- (rowSums((all_settings_genus_asv_mock[,-which(names(all_settings_genus_asv_mock) == "ASV")])>0))>0
all_settings_genus_asv_mock <- all_settings_genus_asv_mock[to_filter,]
all_settings_genus_taxa_mock <- all_settings_genus_taxa[to_filter,]
# BRAY CURTIS
mock_b_dist <- as.data.frame(as.matrix(vegdist(t(all_settings_genus_asv_mock[,-which(names(all_settings_genus_asv_mock) == "ASV")])), method="bray"))
samples <- rownames(mock_b_dist)
mock_b_dist <- as.data.frame(mock_b_dist[,"S1 Mock.Reference"])
rownames(mock_b_dist) <- samples
colnames(mock_b_dist) <- "bray_curtis"
all_settings_genus_mock_object <- construct_phyloseq(all_settings_genus_asv_mock, all_settings_genus_taxa_mock)
all_settings_genus_mock_object_deblur <- all_settings_genus_mock_object
# ORDINATION
ord_b <- ordinate(all_settings_genus_mock_object, method = "PCoA", distance = "bray")
data_for_pca_bray <- as.data.frame(t(all_settings_genus_mock_object@otu_table))
data_for_pca_bray["Setting"] <- substring(rownames(data_for_pca_bray),1,3)
data_for_pca_bray["sample"] <- regmatches(rownames(data_for_pca_bray), regexpr("(Mock)|(Mock.Reference)", rownames(data_for_pca_bray)))
data_for_pca_bray[data_for_pca_bray$sample=="Mock.Reference","sample"] <- "ZYMO REFERENCE"
data_for_pca_bray[data_for_pca_bray$sample=="ZYMO REFERENCE","Setting"] <- "ZYMO REFERENCE"
imp_bray <- ord_b$values$Relative_eig
pca_bray <- ord_b$vectors
ggplot(data_for_pca_bray, aes(x=pca_bray[,1],y=pca_bray[,2], col= Setting)) +
geom_point(show.legend =TRUE) +
xlab(paste("PCo1 ", "(",round(imp_bray[1]*100,2),"%", ")", sep=""))+
ylab(paste("PCo2 ", "(",round(imp_bray[2]*100,2),"%", ")", sep=""))+
theme_bw() +
labs(title="PCoA on Bray-Curtis")
rownames(mock_b_dist)[23] <- "ZYMO REFERENCE"
mock_b_dist["Setting"] <- data_for_pca_bray$Setting
mock_b_dist_deblur <- mock_b_dist
ggplot(data=mock_b_dist,aes(x=Setting, y=bray_curtis, fill=Setting)) +
geom_boxplot() +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 0))
The heatmap visualizes the same information as the boxplot, but for each sample separately for a more detailed overview.
mock_b_dist_settings <- data.frame(
row.names = unique(regmatches(rownames(mock_b_dist), regexpr("Mock\\d+_\\d+", rownames(mock_b_dist)))),
`SETTING 1` = mock_b_dist[grep("S1 ",rownames(mock_b_dist)),"bray_curtis"],
`SETTING 2` = mock_b_dist[grep("S2 ",rownames(mock_b_dist)),"bray_curtis"],
`SETTING 4` = mock_b_dist[grep("S4 ",rownames(mock_b_dist)),"bray_curtis"],
`SETTING 5` = mock_b_dist[grep("S5 ",rownames(mock_b_dist)),"bray_curtis"],
`SETTING 6` = mock_b_dist[grep("S6 ",rownames(mock_b_dist)),"bray_curtis"],
`SETTING 8` = mock_b_dist[grep("S6 ",rownames(mock_b_dist)),"bray_curtis"])
mock_b_dist_settings_deblur <- mock_b_dist_settings
annotation_df <- data.frame(max_diffs = c(2,2,3,2,2,2),fastq_minovlen = c(10,12,12,12,12,10),
fastq_trunclen = c(250,350,350,350,400,400),
row.names = paste0("SETTING.",c(1,2,4,5,6,8)))
pheatmap(mock_b_dist_settings,cluster_rows = FALSE, cellheight = 10, annotation_col = annotation_df)
pheatmap(mock_b_dist_settings,cluster_rows = FALSE, cellheight = 10, kmeans_k = 1,annotation_col = annotation_df)
print("Kruskal Wallis")
## [1] "Kruskal Wallis"
kruskal.test(bray_curtis ~ Setting, data = mock_b_dist)
##
## Kruskal-Wallis rank sum test
##
## data: bray_curtis by Setting
## Kruskal-Wallis chi-squared = 70.47, df = 6, p-value = 3.275e-13
print("Post-hoc")
## [1] "Post-hoc"
dunnTest(bray_curtis ~ Setting,
data=mock_b_dist,
method="bh")
## Warning: Setting was coerced to a factor.
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Benjamini-Hochberg method.
## Comparison Z P.unadj P.adj
## 1 S1 - S2 6.16128153 7.215857e-10 5.051100e-09
## 2 S1 - S4 4.48306580 7.357823e-06 3.862857e-05
## 3 S2 - S4 -1.67821573 9.330499e-02 1.399575e-01
## 4 S1 - S5 6.23560810 4.500264e-10 4.725278e-09
## 5 S2 - S5 0.07432657 9.407505e-01 9.407505e-01
## 6 S4 - S5 1.75254230 7.968060e-02 1.287148e-01
## 7 S1 - S6 4.19749529 2.698833e-05 1.133510e-04
## 8 S2 - S6 -1.96378624 4.955489e-02 1.040653e-01
## 9 S4 - S6 -0.28557051 7.752071e-01 8.139674e-01
## 10 S5 - S6 -2.03811281 4.153865e-02 9.692352e-02
## 11 S1 - S8 7.32312319 2.422652e-13 5.087569e-12
## 12 S2 - S8 1.16184166 2.452998e-01 3.030174e-01
## 13 S4 - S8 2.84005739 4.510542e-03 1.184017e-02
## 14 S5 - S8 1.08751509 2.768092e-01 3.229441e-01
## 15 S6 - S8 3.12562790 1.774259e-03 6.209906e-03
## 16 S1 - ZYMO REFERENCE 3.08347573 2.045978e-03 6.137934e-03
## 17 S2 - ZYMO REFERENCE 1.26661293 2.052937e-01 2.874112e-01
## 18 S4 - ZYMO REFERENCE 1.76149175 7.815521e-02 1.367716e-01
## 19 S5 - ZYMO REFERENCE 1.24469522 2.132438e-01 2.798825e-01
## 20 S6 - ZYMO REFERENCE 1.84570190 6.493550e-02 1.239678e-01
## 21 S8 - ZYMO REFERENCE 0.92400451 3.554840e-01 3.929033e-01
This section combines all results of individual pipelines.
taxa_reads_table_dada2 <- merge(all_settings_genus_mock_object_dada2@tax_table,all_settings_genus_mock_object_dada2@otu_table, by='row.names', all=TRUE)
colnames(taxa_reads_table_dada2)[8:ncol(taxa_reads_table_dada2)] <- paste("DADA2", colnames(taxa_reads_table_dada2)[8:ncol(taxa_reads_table_dada2)])
taxa_reads_table_vsearch <- merge(all_settings_genus_mock_object_vsearch@tax_table,all_settings_genus_mock_object_vsearch@otu_table, by='row.names', all=TRUE)
colnames(taxa_reads_table_vsearch)[8:ncol(taxa_reads_table_vsearch)] <- paste("VSEARCH", colnames(taxa_reads_table_vsearch)[8:ncol(taxa_reads_table_vsearch)])
taxa_reads_table_deblur <- merge(all_settings_genus_mock_object_deblur@tax_table,all_settings_genus_mock_object_deblur@otu_table, by='row.names', all=TRUE)
colnames(taxa_reads_table_deblur) <- gsub(pattern = "-",replacement = "_",x = colnames(taxa_reads_table_deblur))
colnames(taxa_reads_table_deblur)[8:ncol(taxa_reads_table_deblur)] <- paste("DEBLUR", colnames(taxa_reads_table_deblur)[8:ncol(taxa_reads_table_deblur)])
all_settings_genus_mock_all <- merge(taxa_reads_table_dada2[,-1],taxa_reads_table_vsearch[,-1], all=TRUE)
all_settings_genus_mock_all <- merge(all_settings_genus_mock_all,taxa_reads_table_deblur[,-1], all=TRUE)
all_settings_genus_mock_all[is.na(all_settings_genus_mock_all)] <- 0
all_settings_genus_mock_all <- all_settings_genus_mock_all[,-which(names(all_settings_genus_mock_all) == "DEBLUR S1 Mock.Reference")]
all_settings_genus_mock_all <- all_settings_genus_mock_all[,-which(names(all_settings_genus_mock_all) == "VSEARCH S1 Mock.Reference")]
names(all_settings_genus_mock_all)[which(names(all_settings_genus_mock_all) == "DADA2 S1 Mock.Reference")] <- "ZYMO REFERENCE"
#phyloseq object
all_settings_genus_asv_all <- all_settings_genus_mock_all[,7:ncol(all_settings_genus_mock_all)]
all_settings_genus_asv_all["ASV"] <- paste("ASV",1:nrow(all_settings_genus_asv_all))
all_settings_genus_taxa_all <- all_settings_genus_mock_all[,1:6]
all_settings_genus_taxa_all["ASV"] <- paste("ASV",1:nrow(all_settings_genus_taxa_all))
all_settings_genus_object_all <- construct_phyloseq(all_settings_genus_asv_all, all_settings_genus_taxa_all)
ord_b <- ordinate(all_settings_genus_object_all, method = "PCoA", distance = "bray")
data_for_pca_bray <- as.data.frame(t(all_settings_genus_object_all@otu_table))
data_for_pca_bray["Setting"] <- regmatches(rownames(data_for_pca_bray), regexpr("(((DADA2)|(DEBLUR)|(VSEARCH)) S\\d+)|(ZYMO REFERENCE)", rownames(data_for_pca_bray)))
imp_bray <- ord_b$values$Relative_eig
pca_bray <- ord_b$vectors
ggplot(data_for_pca_bray, aes(x=pca_bray[,1],y=pca_bray[,2], col= Setting)) +
geom_point(show.legend =TRUE) +
xlab(paste("PCo1 ", "(",round(imp_bray[1]*100,2),"%", ")", sep=""))+
ylab(paste("PCo2 ", "(",round(imp_bray[2]*100,2),"%", ")", sep=""))+
theme_bw() +
labs(title="PCoA on Bray-Curtis") + scale_color_manual(values=c(
"#1E75B0", "#ABC3E3", "#FA7D0E","#FAB776",
"#2B9D2B", "#D90368", "#D22627","#FA9593",
"#9165B9", "#FAF33E", "#89544A","#C09991",
"#DF75BE", "#F2B2CE", "#7D7D7D","#C3C3C3",
"#B8B921", "#D7D78A", "#17BACB","#9BD6E1",
"#FFD131", "#F4AC32", "#DBB3B1","#A5BE00",
"#3D52D5", "#FFEEDD","#1EFC1E","#C1ADD1",
"#95DB87","#000000","#89544A"))
mock_b_dist_dada2["Setting"] <- paste("DADA2",mock_b_dist_dada2$Setting)
mock_b_dist_vsearch["Setting"] <- paste("VSEARCH",mock_b_dist_vsearch$Setting)
mock_b_dist_deblur["Setting"] <- paste("DEBLUR",mock_b_dist_deblur$Setting)
mock_b_dist_vsearch <- mock_b_dist_vsearch[-which(rownames(mock_b_dist_vsearch) == "ZYMO REFERENCE"),]
mock_b_dist_deblur <- mock_b_dist_deblur[-which(rownames(mock_b_dist_deblur) == "ZYMO REFERENCE"),]
mock_b_dist_all <- rbind(mock_b_dist_dada2,mock_b_dist_vsearch)
mock_b_dist_all <- rbind(mock_b_dist_all,mock_b_dist_deblur)
mock_b_dist_all <- mock_b_dist_all[!(mock_b_dist_all$Setting == "DEBLURS5"),]
mock_b_dist_all[which(rownames(mock_b_dist_all) == "ZYMO REFERENCE"),"Setting"] <- "ZYMO REFERENCE"
ggplot(data=mock_b_dist_all,aes(x=Setting, y=bray_curtis, fill=Setting)) +
geom_boxplot() + scale_fill_manual(values=c(
"#1E75B0", "#ABC3E3", "#FA7D0E","#FAB776",
"#2B9D2B", "#D90368", "#D22627","#FA9593",
"#9165B9", "#FAF33E", "#89544A","#C09991",
"#DF75BE", "#F2B2CE", "#7D7D7D","#C3C3C3",
"#B8B921", "#D7D78A", "#17BACB","#9BD6E1",
"#FFD131", "#F4AC32", "#DBB3B1","#A5BE00",
"#3D52D5", "#FFEEDD","#1EFC1E","#C1ADD1",
"#95DB87","#000000","#89544A")) + scale_x_discrete(guide = guide_axis(angle = 90)) +theme_bw() + theme(legend.position = "none") + ylab(label = "BRAY-CURTIS DISTANCE") + xlab(label = "SETTING")
colnames(mock_b_dist_settings_dada2) <- paste("DADA2",colnames(mock_b_dist_settings_dada2))
colnames(mock_b_dist_settings_vsearch) <- paste("VSEARCH",colnames(mock_b_dist_settings_vsearch))
colnames(mock_b_dist_settings_deblur) <- paste("DEBLUR",colnames(mock_b_dist_settings_deblur))
rownames(mock_b_dist_settings_deblur) <- gsub("-","_",rownames(mock_b_dist_settings_deblur))
mock_b_dist_settings_all <- cbind(mock_b_dist_settings_dada2,mock_b_dist_settings_vsearch)
mock_b_dist_settings_all <- cbind(mock_b_dist_settings_all,mock_b_dist_settings_deblur)
colnames(mock_b_dist_settings_all) <- gsub("ETTING[.]","",colnames(mock_b_dist_settings_all))
pheatmap(mock_b_dist_settings_all,cluster_rows = FALSE, cellheight = 10, angle_col = 90)
pheatmap(mock_b_dist_settings_all,cluster_rows = FALSE, cellheight = 10, kmeans_k = 1)
Here the average performance (precision, recall and F1 score) for each pipeline is plotted.
colnames(performance_mat) <- c("Precision", "Recall")
rownames(performance_mat) <- c(paste0("DADA2 S",1:14),paste0("VSEARCH S",1:9),paste0("DEBLUR S", c(1,2,4,5,6,8)))
performance_mat["Pipeline"] <- rownames(performance_mat)
ggplot(performance_mat, aes(x=reorder(Pipeline,Precision), y=Precision, fill="steelblue")) +
geom_bar(stat="identity",fill="steelblue")+theme_bw() + scale_x_discrete(guide = guide_axis(angle = 90)) +theme_bw() + theme(legend.position = "none") +
xlab("Pipeline")
ggplot(performance_mat, aes(x=reorder(Pipeline,Recall), y=Recall, fill="steelblue")) +
geom_bar(stat="identity",fill="steelblue")+theme_bw() + scale_x_discrete(guide = guide_axis(angle = 90)) +theme_bw() + theme(legend.position = "none") +
xlab("Pipeline")
performance_mat$F1 <- (2*performance_mat$Precision*performance_mat$Recall)/(performance_mat$Precision + performance_mat$Recall)
long_df <- melt(performance_mat)
## Using Pipeline as id variables
wide_width <- 0.9
thin_width <- 0.0
# Create the plot
ggplot(long_df, aes(x = Pipeline, y = value, fill = variable)) +
geom_bar(data = subset(long_df, variable == "F1"),
aes(y = value),
stat = "identity",
position = position_dodge(width = wide_width),
fill = "gray") +
geom_bar(data = subset(long_df, variable != "F1"),
aes(fill = variable, y = value),
stat = "identity",
position = position_dodge(width = thin_width)) +
theme_bw() +
scale_x_discrete(guide = guide_axis(angle = 90)) +
xlab("Pipeline") + ylab("Value") + labs(fill = "Variable Type") + theme(legend.position="bottom")
The best pipelines are selected according to the criteria:
performance_mat[performance_mat$Precision>0.95 &performance_mat$Recall > 0.75,]
best_pipelines <- performance_mat[performance_mat$Precision>0.95 &performance_mat$Recall > 0.75,"Pipeline"]
bray_best_pipelines <- c()
mock_b_dist_all$Setting <- gsub(" ","",mock_b_dist_all$Setting)
best_pipelines <- gsub(" ","",best_pipelines)
for (pipeline in best_pipelines){
bray_best_pipelines <- c(bray_best_pipelines,median(mock_b_dist_all[grep(paste0(pipeline,"$"),mock_b_dist_all$Setting),"bray_curtis"]))
}
best_pipelines[bray_best_pipelines<0.4]
## [1] "DADA2S2" "DADA2S3" "DADA2S4" "DADA2S6" "DADA2S8" "DADA2S10"
## [7] "DADA2S12" "DADA2S14" "VSEARCHS6" "VSEARCHS9" "DEBLURS2" "DEBLURS4"
## [13] "DEBLURS5" "DEBLURS6" "DEBLURS8"
FInally, a significant difference between the Bray-Curtis distances of tested pipelines was observed (P<0.001), with 226 of 406 pairs of pipelines being significant in post-hoc testing at the significance level of 0.05.
# kruskal wallis
kruskal.test(bray_curtis ~ Setting, data = mock_b_dist_all[-which(rownames(mock_b_dist_all)=="ZYMO REFERENCE"),])
##
## Kruskal-Wallis rank sum test
##
## data: bray_curtis by Setting
## Kruskal-Wallis chi-squared = 388.57, df = 28, p-value < 2.2e-16
pt <- dunnTest(bray_curtis ~ Setting,data = mock_b_dist_all[-which(rownames(mock_b_dist_all)=="ZYMO REFERENCE"),],
method="bh")
## Warning: Setting was coerced to a factor.
pt$res
pt$res[pt$res$P.adj <0.05,]
# Create a square matrix to store the adjusted p-values
num_groups <- length(unique(mock_b_dist_all$Setting)) - 1
result_matrix <- matrix(NA, nrow = num_groups, ncol = num_groups,
dimnames = list(gsub(" ","",(unique(mock_b_dist_all[-which(rownames(mock_b_dist_all)=="ZYMO REFERENCE"),]$Setting))), gsub(" ","",(unique(mock_b_dist_all[-which(rownames(mock_b_dist_all)=="ZYMO REFERENCE"),]$Setting)))))
adjusted_p_values <- pt$res$P.adj
comparison_pairs <- pt$res$Comparison
# Populate the matrix with adjusted p-values
for (i in 1:length(comparison_pairs)) {
pair <- strsplit(comparison_pairs[i], "-")[[1]]
group1 <- gsub(" ","",pair[1])
group2 <- gsub(" ","",pair[2])
result_matrix[group1, group2] <- adjusted_p_values[i]
result_matrix[group2, group1] <- adjusted_p_values[i]
}
Heatmap visualizes pairs between which there was a significant difference (red there was no difference, blue there was a difference).
heatmap_df <- result_matrix
heatmap_df[(result_matrix<0.05) & (result_matrix>0.01)] <- 0.5
heatmap_df[(result_matrix<=0.01)] <- 0.01
heatmap_df[(result_matrix>=0.05)] <- 1
pheatmap(heatmap_df,cluster_rows = FALSE, cellheight = 10, cluster_cols = FALSE)